Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Fault-Management in P2P-MPI

Identifieur interne : 003957 ( Main/Exploration ); précédent : 003956; suivant : 003958

Fault-Management in P2P-MPI

Auteurs : Stéphane Genaud [France] ; Emmanuel Jeannot [France] ; Choopan Rattanapoka [Thaïlande]

Source :

RBID : ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6

English descriptors

Abstract

Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.

Url:
DOI: 10.1007/s10766-009-0115-8


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Fault-Management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
<idno type="url">https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001594</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001594</idno>
<idno type="wicri:Area/Istex/Curation">001575</idno>
<idno type="wicri:Area/Istex/Checkpoint">000A64</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000A64</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00425516</idno>
<idno type="url">https://hal.inria.fr/inria-00425516</idno>
<idno type="wicri:Area/Hal/Corpus">006C19</idno>
<idno type="wicri:Area/Hal/Curation">006C19</idno>
<idno type="wicri:Area/Hal/Checkpoint">002A06</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">002A06</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
<idno type="wicri:Area/Main/Merge">003A35</idno>
<idno type="wicri:Area/Main/Curation">003957</idno>
<idno type="wicri:Area/Main/Exploration">003957</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Fault-Management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok</wicri:regionArea>
<wicri:noRegion>Bangkok</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Thaïlande</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">International Journal of Parallel Programming</title>
<title level="j" type="abbrev">Int J Parallel Prog</title>
<idno type="ISSN">0885-7458</idno>
<idno type="eISSN">1573-7640</idno>
<imprint>
<publisher>Springer US; http://www.springer-ny.com</publisher>
<pubPlace>Boston</pubPlace>
<date type="published" when="2009-10-01">2009-10-01</date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="page" from="433">433</biblScope>
<biblScope unit="page" to="461">461</biblScope>
</imprint>
<idno type="ISSN">0885-7458</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0885-7458</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
<keywords scheme="mix" xml:lang="en">
<term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
<li>Thaïlande</li>
</country>
<region>
<li>Grand Est</li>
<li>Lorraine (région)</li>
</region>
<settlement>
<li>Vandœuvre-lès-Nancy</li>
</settlement>
</list>
<tree>
<country name="France">
<region name="Grand Est">
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
</region>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
</country>
<country name="Thaïlande">
<noRegion>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</noRegion>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 003957 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 003957 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6
   |texte=   Fault-Management in P2P-MPI
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022